This report aims at creating a summary of the methodology used in the scope of the capstone project of the Data Science specialization offered by John Hopkins University on Coursera.
For more info, please visit Data Science Specialization.
All the code used to generate this document can be found at the bottom in the code appendix
The objective during this first step is to explore the data, create a corpus and start drawing the strategy for the prediction app, which is the ultimate goal.
Many resources are available online to help build a strategy, as well as a sheer amount of R packages.
Here is a list of ressources I found particularly useful:
- A gentle intro to text mining
- Introduction to the tm package (a special package for text mining)
- Quick start with quanteda
For this exercise we will be only using the documents provided in the English language.
Here are a few features regarding the corpus of documents we have:
We have 3 documents:
- a list of tweets
- a list of blog articles
- a list of news articles
With the following characteristics:
| Tweets | Blogs | News | |
|---|---|---|---|
| Nb of lines | 2360148 | 899288 | 77259 |
| Nb of sentences | 171 | 276 | 199 |
| Average nb of words per line | 69 | 230 | 202 |
| Nb of words in longest line | 140 | 40833 | 5760 |
After removing the most basic stopwords from the english language and the punctuation, here is a peek at the top words contained in the corpus:
And a wordcloud of the 100 most frequent words:
To create the prediction model, here are a few choices I made:
- After evaluating multiple packages, the quanteda package has been chosen to create the model.
- No potential foreign phrases or words are removed from the corpus, as it’s time consuming and irrelevant for what we are trying to achieve here.
- Punctuation will be removed.
- Profanity will be removed, using the following document: Bad words
The next steps will require:
- Exploring n-grams and deciding which “n” is the most relevant - Optimizing the model for computational needs - Building the shiny app
Stay tuned!
Find out more about how to send all code to the appendix
library(dplyr) # load libraries
library(ggplot2)
library(quanteda)
library(data.table)
library(stopwords)
library(RColorBrewer)
library(plotly)
## loading data
twitter <- readLines("~/R/datasciencecoursera/Capstone project/final/en_US/en_US.twitter.txt",
encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines("~/R/datasciencecoursera/Capstone project/final/en_US/en_US.blogs.txt",
encoding = "UTF-8", skipNul = TRUE)
news <- readLines("~/R/datasciencecoursera/Capstone project/final/en_US/en_US.news.txt",
encoding = "UTF-8", skipNul = TRUE)
tweets <- length(twitter)
twitternames <- paste(rep("Twitter",tweets),1:tweets, sep="")
twittercorp <- quanteda::corpus(twitter, docnames=twitternames)
b_art <- length(blogs)
blogsnames <- paste(rep("Blogs", b_art), 1:b_art, sep="")
blogscorp <- quanteda::corpus(blogs, docnames=blogsnames)
news_art <- length(news)
newsnames <- paste(rep("News", news_art), 1:news_art, sep="")
newscorp <- quanteda::corpus(news, docnames=newsnames)
mycorpus <- twittercorp + blogscorp + newscorp
## exploring the data
twitter_sent <- sum(as.data.table(summary(twittercorp))$Sentences)
twitter_char <- round(sum(nchar(twitter))/tweets,0)
twitter_max <- max(nchar(twitter))
blogs_sent <- sum(as.data.table(summary(blogscorp))$Sentences)
blogs_char <- round(sum(nchar(blogs))/b_art,0)
blogs_max <- max(nchar(blogs))
news_sent <- sum(as.data.table(summary(newscorp))$Sentences)
news_char <- round(sum(nchar(news))/news_art,0)
news_max <- max(nchar(news))
explortable <- cbind(c(tweets,twitter_sent,twitter_char,twitter_max),c(b_art,blogs_sent,blogs_char,blogs_max),
c(news_art,news_sent,news_char,news_max))
colnames(explortable) <- c("Tweets", "Blogs", "News")
rownames(explortable) <- c("Nb of lines", "Nb of sentences", "Average nb of words per line", "Nb of words in longest line")
knitr::kable(explortable, caption = "Additional info about our corpus")
## exploring the data without stopwords and punctuation
mydfm <- quanteda::dfm(mycorpus, remove = stopwords("english"), remove_punct = TRUE)
topwords <- data.table(Words=names(topfeatures(mydfm, 25)),Count=as.numeric(topfeatures(mydfm, 25)))
fig <- plot_ly(data = topwords, x=~Words, y = ~Count, type="bar", color = ~Count)
fig <- fig %>% layout(title="Top 25 Words")
fig
textplot_wordcloud(mydfm, max_words= 100, color = RColorBrewer::brewer.pal(8, "Dark2"))